In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pyplot
from matplotlib
Linear Regression is a supervised learning technique that is interested in predicting a response or target $\mathbf{y}$, based on a linear combination of a set $D$ predictors or features, $\mathbf{x}= (1, x_1,\dots, x_D)$ such that,
\begin{equation*} y = \beta_0 + \beta_1 x_1 + \dots + \beta_D x_D = \mathbf{x_i}^T\mathbf{\beta} \end{equation*}Data We Observe
\begin{eqnarray*} y &:& \mbox{response or target variable} \\ \mathbf{x} &:& \mbox{set of $D$ predictor or explanatory variables } \mathbf{x}^T = (1, x_1, \dots, x_D) \end{eqnarray*}What We Are Trying to Learn
\begin{eqnarray*} \beta^T = (\beta_0, \beta_1, \dots, \beta_D) : \mbox{Parameter values for a "best" prediction of } y \rightarrow \hat y \end{eqnarray*}Outcomes We are Trying to Predict
\begin{eqnarray*} \hat y : \mbox{Prediction for the data that we observe} \end{eqnarray*}Matrix Notation
\begin{equation*} \mathbf{Y} = \left( \begin{array}{ccc} y_1 \\ y_2 \\ \vdots \\ y_i \\ \vdots \\ y_N \end{array} \right) \qquad \mathbf{X} = \left( \begin{array}{ccc} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,D} \\ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,D} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{i,1} & x_{i,2} & \dots & x_{i,D} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N,1} & x_{N,2} & \dots & x_{N,D} \\ \end{array} \right) \qquad \beta = \left( \begin{array}{ccc} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_j \\ \vdots \\ \beta_D \end{array} \right) \end{equation*}Why is it called Linear Regression?
It is often asked, why is it called linear regression if we can use polynomial terms and other transformations as the predictors. That is
\begin{equation*} y = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \beta_3 x_1^3 + \beta_4 \sin(x_1) \end{equation*}is still a linear regression, though it contains polynomial and trigonometric transformations of $x_1$. This is due to the fact that the term linear applies to the learned coefficients $\beta$ and not the input features $\mathbf{x}$.
How can we Learn $\beta$?
Linear Regression can be thought of as an optimization problem where we want to minimize some loss function of the error between the prediction $\hat y$ and the observed data $y$.
\begin{eqnarray*} error_i &=& y_i - \hat y_i \\ &=& y_i - \mathbf{x_i^T}\beta \end{eqnarray*}Let's see what these errors look like...
Below we show a simulation where the observed $y$ was generated such that $y= 1 + 0.5 x + \epsilon$ and $\epsilon \sim N(0,1)$. If we assume that know the truth that $y=1 + 0.5 x$, the red lines demonstrate the error (or residuals) between the observed and the truth.
In [4]:
#############################################################
# Demonstration - What do Residuals Look Like
#############################################################
np.random.seed(33) # Setting a seed allows reproducability of experiments
beta0 = 1 # Creating an intercept
beta1 = 0.5 # Creating a slope
# Randomly sampling data points
x_example = np.random.uniform(0,5,10)
y_example = beta0 + beta1 * x_example + np.random.normal(0,1,10)
line1 = beta0 + beta1 * np.arange(-1, 6)
f = plt.figure()
plt.scatter(x_example,y_example) # Plotting observed data
plt.plot(np.arange(-1,6), line1) # Plotting the true line
for i, xi in enumerate(x_example):
plt.vlines(xi, beta0 + beta1 * xi, y_example[i], colors='red') # Plotting Residual Lines
plt.annotate('Error or "residual"', xy = (x_example[5], 2), xytext = (-1.5,2.1),
arrowprops=dict(width=1,headwidth=7,facecolor='black', shrink=0.01))
f.set_size_inches(10,5)
plt.title('Errors in Linear Regression')
plt.show()
Choosing a Loss Function to Optimize
Historically Linear Regression has been solved using the method of Least Squares where we are interested in minimizing the mean squared error loss function of the form:
\begin{eqnarray*} Loss(\beta) = MSE &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat y_i)^2 \\ &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - \mathbf{x_i^T}\beta)^2 \\ \end{eqnarray*}Where $N$ is the total number of observations. Other loss functions can be used, but using mean squared error (also referred to sum of the squared residuals in other text) has very nice properities for closed form solutions. We will use this loss function for both gradient descent and to create a closed form matrix solution.
We'll use this dataset to investigate Linear Regression. The dataset consists of 337 observations and 18 variables from the set of Major League Baseball players who played at least one game in both the 1991 and 1992 seasons, excluding pitchers. The dataset contains the 1992 salaries for that population, along with performance measures for each player. Four categorical variables indicate how free each player was to move to other teams.
Reference
Filename
Variables
What we will try to predict
We will attempt to predict the players salary based upon some predictor variables such as Hits, OBP, Walks, RBIs, etc.
Loading data in python from csv files in python can be done by a few different ways. The numpy package has a function called 'genfromtxt' that can read csv files, while the pandas library has the 'read_csv' function. Remember that we have imported numpy and pandas as np
and pd
respectively at the top of this notebook. An example using pandas is as follows:
pd.read_csv(filename, **args)
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_csv.html
Student Action - Load the 'baseball.dat.txt' file into a variable called 'baseball'. Then use baseball.head() to view the first few entries
In [7]:
#######################################################################
# Student Action - Load the file 'baseball.dat.txt' using pd.read_csv()
#######################################################################
baseball = pd.read_csv('data/baseball.dat.txt')
Crash Course: Plotting with Matplotlib
At the top of this notebook we have imported the the package pyplot as plt
from the matplotlib
library. matplotlib
is a great package for creating simple plots in Python. Below is a link to their tutorial for basic plotting.
Tutorials
Simple Plotting
import matplotlib.pyplot as plt
fig = plt.figure()
plt.plot(x,y)
- A line plotplt.scatter(x,y)
- Scatter Plotsplt.hist(x)
- Histogram of a variableplt.xlabel('String')
plt.ylabel('String')
plt.title('String')
fig.set_size_inches(width, height)
plt.show()
magic
command %matplotlib inline
. Transforming Variables
We'll talk more about numpy later, but to perform the logarithmic transformation use the command
np.log(
$array$)
In [8]:
#############################################################
# Demonstration - Plot a Histogram of Hits
#############################################################
f = plt.figure()
plt.hist(baseball['Hits'], bins=15)
plt.xlabel('Number of Hits')
plt.ylabel('Frequency')
plt.title('Histogram of Number of Hits')
f.set_size_inches(10, 5)
plt.show()
In [9]:
#############################################################
# Student Action - import matplotlib.pylot
# - Plot a Histogram of log(Salaries)
#############################################################
f = plt.figure()
plt.hist(np.log(baseball['Salary']), bins = 15)
plt.xlabel('log(Salaries)')
plt.ylabel('Frequency')
plt.title('Histogram of log Salaries')
f.set_size_inches(10, 5)
plt.show()
In [10]:
#############################################################
# Studdent Action - Plot a Scatter Plot of Salarie vs. Hitting
#############################################################
f = plt.figure()
plt.scatter(baseball['Hits'], np.log(baseball['Salary']))
plt.xlabel('Hits')
plt.ylabel('log(Salaries)')
plt.title('Scatter Plot of Salarie vs. Hitting')
f.set_size_inches(10, 5)
plt.show()
In Linear Regression we are interested in optimizing our loss function $Loss(\beta)$ to find the optimatal $\beta$ such that
\begin{eqnarray*} \hat \beta &=& \arg \min_{\beta} \frac{1}{N} \sum_{i=1}^{N} (y_i - \mathbf{x_i^T}\beta)^2 \\ &=& \arg \min_{\beta} \frac{1}{N} \mathbf{(Y - X\beta)^T (Y - X\beta)} \\ \end{eqnarray*}One optimization technique called 'Gradient Descent' is useful for finding an optimal solution to this problem. Gradient descent is a first order optimization technique that attempts to find a local minimum of a function by updating its position by taking steps proportional to the negative gradient of the function at its current point. The gradient at the point indicates the direction of steepest ascent and is the best guess for which direction the algorithm should go.
If we consider $\theta$ to be some parameters we are interested in optimizing, $L(\theta)$ to be our loss function, and $\alpha$ to be our step size proportionality, then we have the following algorithm:
Algorithm - Gradient Descent
For our problem at hand, we therefore need to find $\nabla L(\beta)$. The deriviative of $L(\beta)$ due to the $j^{th}$ feature is:
\begin{eqnarray*} \frac{\partial L(\beta)}{\partial \beta_j} = -\frac{2}{N}\sum_{i=1}^{N} (y_i - \mathbf{x_i^T}\beta)\cdot{x_{i,j}} \end{eqnarray*}In matrix notation this can be written:
\begin{eqnarray*} Loss(\beta) &=& \frac{1}{N}\mathbf{(Y - X\beta)^T (Y - X\beta)} \\ &=& \frac{1}{N}\mathbf{(Y^TY} - 2 \mathbf{\beta^T X^T Y + \beta^T X^T X\beta)} \\ \nabla_{\beta} L(\beta) &=& \frac{1}{N} (-2 \mathbf{X^T Y} + 2 \mathbf{X^T X \beta)} \\ &=& -\frac{2}{N} \mathbf{X^T (Y - X \beta)} \\ \end{eqnarray*}
In [16]:
###################################################################
# Student Action - Programming the Gradient
###################################################################
def gradient(X, y, betas):
#****************************
# Your code here!
return -2.0/len(X)*np.dot(X.T, y - np.dot(X, betas))
#****************************
#########################################################
# Testing your gradient function
#########################################################
np.random.seed(33)
X = pd.DataFrame({'ones':1,
'X1':np.random.uniform(0,1,50)})
y = np.random.normal(0,1,50)
betas = np.array([-1,4])
grad_expected = np.array([ 2.98018138, 7.09758971])
grad = gradient(X,y,betas)
try:
np.testing.assert_almost_equal(grad, grad_expected)
print "Test Passed!"
except AssertionError:
print "*******************************************"
print "ERROR: Something isn't right... Try Again!"
print "*******************************************"
Student Action - Use your Gradient Function to complete the Gradient Descent for the Baseball Dataset
We have set-up the all necessary matrices and starting values. In the designated section below code the algorithm from the previous section above.
In [18]:
# Setting up our matrices
Y = np.log(baseball['Salary'])
N = len(Y)
X = pd.DataFrame({'ones' : np.ones(N),
'Hits' : baseball['Hits']})
p = len(X.columns)
# Initializing the beta vector
betas = np.array([0.015,5.13])
# Initializing Alpha
alph = 0.00001
# Setting a tolerance
tol = 1e-8
###################################################################
# Student Action - Programming the Gradient Descent Algorithm Below
###################################################################
niter = 1.
while (alph*np.linalg.norm(gradient(X,Y,betas)) > tol) and (niter < 20000):
#****************************
# Your code here!
betas -= alph*gradient(X, Y, betas)
niter += 1
#****************************
print niter, betas
try:
beta_expected = np.array([ 0.01513772, 5.13000121])
np.testing.assert_almost_equal(betas, beta_expected)
print "Test Passed!"
except AssertionError:
print "*******************************************"
print "ERROR: Something isn't right... Try Again!"
print "*******************************************"
Comments on Gradient Descent
See the Supplementary Material for any help necessary with scripting this in Python.
Let's try to find the value of $X$ that maximizes the following function:
\begin{equation} f(x) = w \times \frac{1}{\sqrt{2\pi \sigma_1^2}} \exp \left( - \frac{(x-\mu_1)^2}{2\sigma_1^2}\right) + (1-w) \times \frac{1}{\sqrt{2\pi \sigma_2^2}} \exp \left( - \frac{(x-\mu_2)^2}{2\sigma_2^2}\right) \end{equation}where $w=0.3$, $\mu_1 = 3, \sigma_1^2=1$ and $\mu_2 = -1, \sigma_2^2=0.5$
Let's visualize this function
In [19]:
x1 = np.arange(-10, 15, 0.05)
mu1 = 6.5
var1 = 3
mu2 = -1
var2 = 10
weight = 0.3
def mixed_normal_distribution(x, mu1, var1, mu2, var2):
pdf1 = np.exp( - (x - mu1)**2 / (2*var1) ) / np.sqrt(2 * np.pi * var1)
pdf2 = np.exp( - (x - mu2)**2 / (2*var2) ) / np.sqrt(2 * np.pi * var2)
return weight * pdf1 + (1-weight )*pdf2
pdf = mixed_normal_distribution(x1, mu1, var1, mu2, var2)
fig = plt.figure()
plt.plot(x1, pdf)
fig.set_size_inches([10,5])
plt.show()
In [20]:
def mixed_gradient(x, mu1, var1, mu2, var2):
grad_pdf1 = np.exp( - (x - mu1)**2 / (2*var1) ) * ((x-mu1)/var1) / np.sqrt(2 * np.pi * var1)
grad_pdf2 = np.exp( - (x - mu2)**2 / (2*var2) ) * ((x-mu2)/var2) / np.sqrt(2 * np.pi * var2)
return weight * grad_pdf1 + (1-weight)*grad_pdf2
# Initialize X
x = 3.25
# Initializing Alpha
alph = 5
# Setting a tolerance
tol = 1e-8
niter = 1.
results = []
while (alph*np.linalg.norm(mixed_gradient(x, mu1, var1, mu2, var2)) > tol) and (niter < 500000):
#****************************
results.append(x)
x = x - alph * mixed_gradient(x, mu1, var1, mu2, var2)
niter += 1
#****************************
print x, niter
if niter < 500000:
exes = mixed_normal_distribution(np.array(results), mu1, var1, mu2, var2)
fig = plt.figure()
plt.plot(x1, pdf)
plt.plot(results, exes, color='red', marker='x')
plt.ylim([0,0.1])
fig.set_size_inches([20,10])
plt.show()
From the last section, you may have recognized that we could actually solve for $\beta$ directly.
\begin{eqnarray*} Loss(\beta) &=& \frac{1}{N}\mathbf{(Y - X\beta)^T (Y - X\beta)} \\ \nabla_{\beta} L(\beta) &=& \frac{1}{N} (-2 \mathbf{X^T Y} + 2 \mathbf{X^T X \beta}) \\ \end{eqnarray*}Setting to zero
\begin{eqnarray*} -2 \mathbf{X^T Y} + 2 \mathbf{X^T X} \beta &=& 0 \\ \mathbf{X^T X \beta} &=& \mathbf{X^T Y} \\ \end{eqnarray*}If we assume that the columns $X$ are linearly independent then
\begin{eqnarray*} \hat \beta &=& \mathbf{(X^T X)^{-1}X^T Y} \\ \end{eqnarray*}This is called the Ordinary Least Squares (OLS) Estimator
Student Action - Solve for $\hat \beta$ directly using OLS on the Baseball Dataset - 10 mins
In [21]:
# Setting up our matrices
y = np.log(baseball['Salary'])
N = len(Y)
X = pd.DataFrame({'ones' : np.ones(N),
'Hits' : baseball['Hits']})
#############################################################
# Student Action - Program a closed form solution for
# Linear Regression. Compare with Gradient
# Descent.
#############################################################
def solve_linear_regression(X, y):
#****************************
return np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, y))
#****************************
betas = solve_linear_regression(X,y)
try:
beta_expected = np.array([ 0.01513353, 5.13051682])
np.testing.assert_almost_equal(betas, beta_expected)
print "Betas: ", betas
print "Test Passed!"
except AssertionError:
print "*******************************************"
print "ERROR: Something isn't right... Try Again!"
print "*******************************************"
Comments on solving the loss function directly
Disadvantage: Inverting a Matrix can be a computational expensive operation
If we have a design matrix that has $N$ observations and $D$ predictors, then X is $(N\times D)$ it follows then that
\begin{eqnarray*} \mathbf{X^TX} \mbox{ is of size } (D \times N) \times (N \times D) = (D \times D) \\ \end{eqnarray*}
If a matrix is of size $(D\times D)$, the computational cost of inverting it is $O(D^3)$.
As we've shown in the previous two exercises, when coding these algorithms ourselves, we must consider many things such as selecting step sizes, considering the computational cost of inverting matrices. For many applications though, packages have been created that have taken into consideration many of these parameter selections. We now turn our attention to the Python package for Machine Learning called 'scikit-learn'.
Included is the documentation for the scikit-learn implementation of Ordinary Least Squares from their linear models package
Generalized Linear Models Documentation:
LinearRegression Class Documentation:
From this we that we'll need to import the module linear_model
using the following:
from sklearn import linear_model
Let's examine an example using the LinearRegression
class from scikit-learn. We'll continue with the simulated data from the beginning of the exercise.
Notes
linear_model.LinearRegression()
creates an object of class sklearn.linear_model.base.LinearRegression
fit_intercept = True
: automatically adds a column vector of ones for an interceptnormalize = False
: defaults to not normalizing the input predictorscopy_X = False
: defaults to not copying Xn_jobs = 1
: The number of jobs to use for the computation. If -1 all CPUs are used. This will only provide speedup for n_targets > 1 and sufficient large problems..fit(X,y)
can be usedpd.DataFrame()
.predict(X)
can be used.coef_
for the coefficients for the predictors and .intercept
for $\beta_0$
In [22]:
#############################################################
# Demonstration - scikit-learn with Regression Example
#############################################################
from sklearn import linear_model
lmr = linear_model.LinearRegression()
lmr.fit(pd.DataFrame(x_example), pd.DataFrame(y_example))
xTest = pd.DataFrame(np.arange(-1,6))
yHat = lmr.predict(xTest)
f = plt.figure()
plt.scatter(x_example, y_example)
p1, = plt.plot(np.arange(-1,6), line1)
p2, = plt.plot(xTest, yHat)
plt.legend([p1, p2], ['y = 1 + 0.5x', 'OLS Estimate'], loc=2)
f.set_size_inches(10,5)
plt.show()
print lmr.coef_, lmr.intercept_
In [24]:
#######################################################################
# Student Action - Use scikit-learn to calculate the beta coefficients
#
# Note: You no longer need the intercept column in your X matrix for
# sci-kit Learn. It will add that column automatically.
#######################################################################
lmr2 = linear_model.LinearRegression(fit_intercept=True)
lmr2.fit(pd.DataFrame(baseball['Hits']), np.log(baseball['Salary']))
xtest = np.arange(0,200)
ytest = lmr2.intercept_ + lmr2.coef_*xtest
f = plt.figure()
plt.scatter(baseball['Hits'], np.log(baseball['Salary']))
plt.plot(xtest, ytest, color='r', linewidth=3)
f.set_size_inches(10,5)
plt.show()
print lmr2.coef_, lmr2.intercept_
In the real world, Linear Regression for predictive modeling doesn't end once you've fit the model. Models are often fit and used to predict user behavior, used to quantify business metrics, or sometimes used to identify cats faces for internet points. In that pursuit, it isn't really interesting to fit a model and assess its performance on data that has already been observed. The real interest lies in how it predicts future observations!
Often times then, we may be susceptible to creating a model that is perfected for our observed data, but that does not generalize well to new data. In order to assess how we perform to new data, we can score the model on both the old and new data, and compare the models performance with the hope that the it generalizes well to the new data. After lunch we'll introduce some techniques and other methods to better our chances of performing well on new data.
Before we break for lunch though, let's take a look at a simulated dataset to see what we mean...
Situation
Imagine that last year a talent management company managed 400 celebrities and tracked how popular they were within the public eye, as well various predictors for that metric. The company is now interested in managing a few new celebrities, but wants to sign those stars that are above a certain 'popularity' threshold to maintain their image.
Our job is to predict how popular each new celebrity will be over the course of the coming year so that we make that best decision about who to manage. For this analysis we'll use a function l2_error
to compare our errors on a training set, and on a test set of celebrity data.
The variable celeb_data_old
represents things we know about the previous batch of celebrities. Each row represents one celeb. Each column represents some tangible measure about them -- their age at the time, number of Twitter followers, voice squeakiness, etc. The specifics of what each column represents aren't important.
Similarly, popularity_score_old
is a previous measure of the celebrities popularity.
Finally, celeb_data_new
represents the same information that we had from celeb_data_old
but for the new batch of internet wonders that we're considering.
How can we predict how popular the NEW batch of celebrities will be ahead of time so that we can decide who to sign? And are these estimates stable from year to year?
In [25]:
with np.load('data/mystery_data_old.npz') as data:
celeb_data_old = data['celeb_data_old']
popularity_old = data['popularity_old']
celeb_data_new = data['celeb_data_new']
lmr3 = linear_model.LinearRegression()
lmr3.fit(celeb_data_old, popularity_old)
predicted_popularity_old = lmr3.predict(celeb_data_old)
predicted_popularity_new = lmr3.predict(celeb_data_new)
def l2_error(y_true, y_pred):
"""
calculate the sum of squared errors (i.e. "L2 error")
given a vector of true ys and a vector of predicted ys
"""
diff = (y_true-y_pred)
return np.sqrt(np.dot(diff, diff))
print "Predicted L2 Error:", l2_error(popularity_old, predicted_popularity_old)
In [26]:
with np.load('data/mystery_data_new.npz') as data:
popularity_new = data['popularity_new']
print "Predicted L2 Error:", l2_error(popularity_new, predicted_popularity_new)
Something's not right... our model seems to be performing worse on this data! Our model performed so well on last year's data, why didn't it work on the data from this year?